Search CORE

14 research outputs found

Leksikograafilise tarkvara Sketch Engine eesti keele moodul

Author: Jürviste Madis
Kallas Jelena
Tuulik Maria
Publication venue: University of Tartu
Publication date: 01/01/2012
Field of study

2010. aasta sügisel alustas Eesti Keele Instituut koos ettevõttega Lexical Computing Ltd. leksikograafilise tarkvara Sketch Engine (Kilgarriff jt 2004) eestikeelse mooduli väljatöötamist. Artiklis kirjeldatakse programmi põhifunktsioone. Põhjalikumalt käsitletakse funktsiooni Word Sketch (ee sõnavisand) võimalusi. Tutvustatakse sõnavisandite grammatika koostamise põhimõtteid, vaadeldakse eraldi substantiivide, adjektiivide ja verbide sõnavisandites esitatud süntagmaatilisi seoseid (st grammatilisi ja leksikaalseid kollokatsioone) ning arutletakse mooduli edasiarendusvõimaluste üle. Lisaks analüüsitakse, mil määral saab sõnavisandeid kasutada verbide lausemallide tuvastamise

Journals from University of Tartu

Directory of Open Access Journals

State-of-the-art on monolingual lexicography for Estonia

Author: Jelena Kallas
Kristina Koppel
Margit Langemets
Maria Tuulik
Publication venue: 'University of Ljubljana'
Publication date: 01/04/2019
Field of study

The paper describes the state of the art of monolingual lexicography in Estonia. Firstly, we describe the current situation in Estonia and the main public functions performed by the Institute of the Estonian Language. Secondly, we provide an overview of the primary types of monolingual academic dictionaries (dictionaries of Standard Estonian and explanatory dictionaries) published in Estonia since the 20th century. Monolingual learner’s lexicography has emerged as a new field in the 2010s, focusing on basic vocabulary and collocations. Thirdly, we give a short overview of accessibility policy and availability of language resources for Estonian. Finally, we envisage the future work in the field of lexicography in the Institute. Within the framework of the new dictionary writing system Ekilex the Institute is moving away from presenting separate interfaces for different dictionaries towards a unified data model in order to provide the data in the aggregated form

Directory of Open Access Journals

Journals of Faculty of Arts, University of Ljubljana

Designing the ELEXIS Parallel Sense-Annotated Dataset in 10 European Languages

Author: Federico Martelli
Győrffy András
Jelena Kallas
Lipp Veronika
Polona Gantar
Roberto Navigli
Simon Krek
Simon László
Váradi Tamás
Publication venue: Lexical Computing
Publication date: 01/01/2021
Field of study

Repository of the Academy's Library

Antiretroviral drug resistance and viral tropism in HIV-1 CRF06_cpx infected patients failing antiretroviral (ARV) therapy

Author: Ene-Ly Jõgeda
Eveli Kallas
Irja Lutsar
Jelena Smidt
Kristi Huik
Külliki Ainsalu
Lilia Novikova
Merit Pauskar
Radko Avi
Svetlana Semjonova
Tõnis Karki
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

D3.8 Lexical-semantic analytics for NLP

Author: Campagnano Cesare
Costa Rute
de Does Jesse
Dobrovoljc Kaja
Frontini Francesca
Gantar Polona
Kallas Jelena
Koppel Kristina
Krek Simon
Langemets Margit
Martelli Federico
Maru Marco
Munda Tina
Navigli Roberto
Nimb Sanni
Olsen Sussi
Quochi Valeria
Salgado Ana de Castro
Tempelaars Rob
Tiberius Carole
Ureña-Ruiz Rafael-J.
Velardi Paola
Čibej Jaka
Publication venue: ELEXIS - European Lexicographic Infrastructure
Publication date: 01/01/2022
Field of study

UIDB/03213/2020 UIDP/03213/2020The present document illustrates the work carried out in task 3.3 (work package 3) of ELEXIS project focused on lexical-semantic analytics for Natural Language Processing (NLP). This task aims at computing analytics for lexical-semantic information such as words, senses and domains in the available resources, investigating their role in NLP applications. Specifically, this task concentrates on three research directions, namely i) sense clustering, in which grouping senses based on their semantic similarity improves the performance of NLP tasks such as Word Sense Disambiguation (WSD), ii) domain labeling of text, in which the lexicographic resources made available by the ELEXIS project for research purposes allow better performances to be achieved, and finally iii) analysing the diachronic distribution of senses, for which a software package is made available.publishersversionpublishe

Repositório da Universidade Nova de Lisboa

An insight into lexicographic practices in Europe Results of the extended ELEXIS Survey on User Needs

Author: Kallas Jelena
Koeva Svetla
Kosem Iztok
Langemets Margit
Tiberius Carole
Publication venue: Mannheim : Leibniz-Institut für Deutsche Sprache (IDS)
Publication date: 11/07/2022
Field of study

The paper presents the results of a survey on lexicographic practices and lexicographers’ needs across Europe that was conducted in the context of the Horizon 2020 project European Lexicographic Infrastructure (ELEXIS) among the observer institutions of the project. The survey is a revised and upgraded version of the survey which was originally conducted among ELEXIS lexicographic partner institutions in 2018 (Kallas et al. 2019a). The main goal of this new survey was to complement the data from the ELEXIS lexicographic partner institutions in order to get a more complete picture of lexicographic practices both for born-digital and retro-digitised resources in Europe. The results offer a detailed insight into many aspects of the lexicographic process at European institutions, such as funding, training, staff, lexicographic expertise, software and tools. In addition, the survey reflects on current trends in lexicography and reveals what institutions see as the most important emerging trends that will affect lexicography in the short-term and long-term future. Overall, the results provide valuable input informing the development of tools, resources, guidelines and training materials within ELEXIS

NEUROSURGERY ENTHUSIASTIC WOMEN SOCIETY

Publikationsserver des Instituts für Deutsche Sprache

The EKI Combined Dictionary 2022 (ELEXIS)

Author: Hein Indrek
Jürviste Madis
Kallas Jelena
Kiisla Olga
Koppel Kristina
Langemets Margit
Leemets Tiina
Mäearu Sirje
Paet Tiina
Päll Peeter
Raadik Maire
Risberg Lydia
Sai Edgar
Smirnova Alina
Tammik Hanna
Tavast Arvi
Tiits Mai
Tsepelina Katrin
Tubin Valentina
Tuulik Maria
Valdre Tiia
Viks Ülle
Publication venue: Institute of the Estonian Language
Publication date: 22/04/2022
Field of study

Eesti Keele Ühendsõnastik 2022 (EKI Combined Dictionary 2022) displays information from different lexical databases: "The Dictionary of Estonian 2019", "Estonian Collocations Dictionary 2019", "Basic Estonian Dictionary" (2014), "The Estonian Morphological Database of the Institute of the Estonian Language 2022". It displays also information from bilingual lexical databases: "Estonian-Russian orthographic dictionary for students 2018" (1st edition 2011), "Estonian-Russian Dictionary 2018" (1st edition 1997–2009), "The Russian Morphological Database of the Institute of the Estonian Language 2022". The data is stored in Ekilex's PostgreSQL database and accessible through API. Ekilex is in-house DWS of the Institite of the Estonian Language. Ekilex is hosted in the Estonian Scientific Computing Infrastructure (ETAIS) cloud. See also: https://doi.org/10.15155/3-00-0000-0000-0000-08C0A

Common Language Resources and Technology Infrastructure - Slovenia

eTranslation TermBank: stimulating the collection of terminological resources for automated translation:Proceedings of the XVIII EURALEX International Congress

Copenhagen University Research Information System

Parallel sense-annotated corpus ELEXIS-WSD 1.0

Author: Costa Rute
Dobrovoljc Kaja
Frontini Francesca
Gantar Polona
Győrffy András
Kallas Jelena
Koeva Svetla
Koppel Kristina
Krek Simon
Langemets Margit
Lipp Veronika
László Simon
Martelli Federico
Monachini Monica
Munda Tina
Navigli Roberto
Nimb Sanni
Olsen Sussi
Quochi Valeria
Salgado Ana
Sancho-Sánchez José-Luis
Sandford Pedersen Bolette
Tempelaars Rob
Tiberius Carole
Ureña-Ruiz Rafael
Váradi Tamás
Üksik Tiiu
Čibej Jaka
Publication venue: Jožef Stefan Institute
Publication date: 28/07/2022
Field of study

ELEXIS-WSD is a parallel sense-annotated corpus in which content words (nouns, adjectives, verbs, and adverbs) have been assigned senses. Version 1.0 contains sentences for 10 languages: Bulgarian, Danish, English, Spanish, Estonian, Hungarian, Italian, Dutch, Portuguese, and Slovene. The corpus was compiled by automatically extracting a set of sentences from WikiMatrix (Schwenk et al., 2019), a large open-access collection of parallel sentences derived from Wikipedia, using an automatic approach based on multilingual sentence embeddings. The sentences were manually validated according to specific formal, lexical and semantic criteria (e.g. by removing incorrect punctuation, morphological errors, notes in square brackets and etymological information typically provided in Wikipedia pages). To obtain a satisfying semantic coverage, we filtered out sentences with less than 5 words and less than 2 polysemous words were filtered out. Subsequently, in order to obtain datasets in the other nine target languages, for each selected sentence in English, the corresponding WikiMatrix translation into each of the other languages was retrieved. If no translation was available, the English sentence was translated manually. The resulting corpus is comprised of 2,024 sentences for each language. The sentences were tokenized, lemmatized, and tagged with POS tags using UDPipe v2.6 (https://lindat.mff.cuni.cz/services/udpipe/). Senses were annotated using LexTag (https://elexis.babelscape.com/): each content word (noun, verb, adjective, and adverb) was assigned a sense from among the available senses from the sense inventory selected for the language (see below) or BabelNet. Sense inventories were also updated with new senses during annotation. List of sense inventories BG: Dictionary of Bulgarian DA: DanNet – The Danish WordNet EN: Open English WordNet ES: Spanish Wiktionary ET: The EKI Combined Dictionary of Estonian HU: The Explanatory Dictionary of the Hungarian Language IT: PSC + Italian WordNet NL: Open Dutch WordNet PT: Portuguese Academy Dictionary (DACL) SL: Digital Dictionary Database of Slovene The corpus is available in a CONLL-like tab-separated format. In order, the columns contain the token ID, its form, its lemma, its UPOS-tag, its whitespace information (whether the token is followed by a whitespace or not), the ID of the sense assigned to the token, and the index of the multiword expression (if the token is part of an annotated multiword expression). Each language has a separate sense inventory containing all the senses (and their definitions) used for annotation in the corpus. Not all the senses from the sense inventory are necessarily included in the corpus annotations: for instance, all occurrences of the English noun "bank" in the corpus might be annotated with the sense of "financial institution", but the sense inventory also contains the sense "edge of a river" as well as all other possible senses to disambiguate between. For more information, please refer to 00README.txt

Common Language Resources and Technology Infrastructure - Slovenia